10/01/2022

A very brief history of R

  • R is an implementation of S language, created in 1976.
  • A modified version of S (S-PLUS) was developed in 1993.
  • Developed in 1991 by Ross Ihaka and Robert Gentleman and became free in 1995






Why R?

  • Open source
  • Reasonable fast
  • Operators for array (eg. matrix) analysis
  • Active development (packages)
  • Active user community
  • Several packages for data visualization
  • Integration with distributed version control system

But R …

  • It’s not user friendly
  • Has no commercial support
  • It’s limited by RAM amount
  • Steep learning curve

How to use R

R can be downloaded here.

One of the most used ways to use R is through the free and open source (but there is also commercial version) Rstudio Desktop, which helps to control all the different entities in R.

Rstudio can be downloaded here.

Rstudio: First steps

Rstudio: Set the working directory

  • Create a folder with name “esercitazione”
  • Copy the file “gene_expression_data.txt” in the folder “esercitazione”
  • Open Rstudio
  • Go to Session –> Set Working directory –> Choose Directory…
  • Click Open and confirm

Rstudio: Set the working directory

  • Go to File –> New File –> R script
  • Go to File –> Save –> Put in the esercitazione folder with name es1.R
  • In the “Source Panel” type 3+9 and press Run
  • On a new line type 45/8 and press Run
  • You can use the shortcut ctrl + Enter (Windows/Linux) or CMD + Enter to run a single line
  • ctrl + shift + Enter (Windows/Linux) or CMD + shift + Enter run all the code

R operations:

75+4
## [1] 79
41-9
## [1] 32
99/12
## [1] 8.25
17*6
## [1] 102

R operations:

9^3
## [1] 729
27%%3
## [1] 0
44%%7
## [1] 2

R logical operations:



10e4==100000
24!=34
15>20
30<21

R logical operations:


10e4==100000
## [1] TRUE
24!=34
## [1] TRUE

R logical operations:


15>20
## [1] FALSE
30<21
## [1] FALSE

R characters:

R is also able to manage character variables which must be surronded by “” or ’ ’


"Hello world!"
## [1] "Hello world!"
"I am a string"
## [1] "I am a string"

Vectors, objects and assignment

Vectors:

  • Numeric vectors
c(10, 30, 50,12)
## [1] 10 30 50 12

Note the c() function which means ‘concatenate’

  • Character vectors
c("I", "am", "a","string vector")
## [1] "I"             "am"            "a"             "string vector"

Assign variable to objects

Variables can be assigned to object using the “<-” or “=” symbol.

marks <- c(26, 30, 28,19)

basket <- c("peach", "orange", "cherries", "pineapple")

The content of any object can be shown just typing its name.

marks
## [1] 26 30 28 19
basket
## [1] "peach"     "orange"    "cherries"  "pineapple"

Classes : numeric and characters

We can check the class of any object with the function class(objectname)

class(marks)
## [1] "numeric"
class(basket)
## [1] "character"

Classes: Factors

Factors are variables in R which take on a limited number of different values; They are therefore Categorical, referring to categories that are called levels in R syntax.

myBiologicalType <- c("coding","UTR","intron","coding",
                      "UTR","regulatory","UTR")
myBiologicalType <- as.factor(myBiologicalType)
levels(myBiologicalType)
## [1] "coding"     "intron"     "regulatory" "UTR"

Classes: Factors

Although it’s not initially obvious, factors are very important. Consider, for example a vector of 100,000 entries summarizing the sex of indivuduals for a given species.

library(pryr)
sexes <- sample(c("male","female"),size = 100000,replace = T)
length(sexes)
## [1] 100000
object_size(sexes)
## 800,160 B
object_size(as.factor(sexes))
## 400,560 B

Working on vectors

In R, objects can be modified in many ways:

let’s create two vectors:

vec1 <- 1:10 # What am I doing here?

vec2 <- seq(from=100,to = 1000,by = 100) # And here?

Working on vectors

The length of the vector can be retrieved through the length() function:

length (vec1)
## [1] 10
length(vec2)
## [1] 10

Working on vectors: manipulation

we can subset vectors indicating the index in square bracket

vec2 #show the full vector
##  [1]  100  200  300  400  500  600  700  800  900 1000
vec2[3] # show the third element of the vector
## [1] 300
vec2[8] # show the eighth element of a vector
## [1] 800

Working on vectors :manipulation

we can subset vectors indicating the index in square bracket

vec2 #show the full vector
##  [1]  100  200  300  400  500  600  700  800  900 1000
vec2[3:6] # show FROM the third element TO the sixth element of the vector
## [1] 300 400 500 600
vec2[c(2,5,9)] # show the 2nd,5th and 9th element
## [1] 200 500 900

Working on vectors :manipulation

We can replace values in vectors using this syntax:

vec2 #show the full vector
##  [1]  100  200  300  400  500  600  700  800  900 1000
vec2[3] # show the third element
## [1] 300
vec2[3] <- 450  # replace the 3rd elemnt with 450
vec2[3]
## [1] 450

Working on vectors: operations

When working with vectors it is possible to apply an operation to all the variables at once:

vec2
##  [1]  100  200  450  400  500  600  700  800  900 1000
vec2/2  # replace the 3rd elemnt with 450
##  [1]  50 100 225 200 250 300 350 400 450 500
vec2-30
##  [1]  70 170 420 370 470 570 670 770 870 970

Working on vectors: operations

When working with vectors it is possible to apply an operation to all the variables at once:

vec2
##  [1]  100  200  450  400  500  600  700  800  900 1000
vec2*2
##  [1]  200  400  900  800 1000 1200 1400 1600 1800 2000
vec2+50
##  [1]  150  250  500  450  550  650  750  850  950 1050

Working on vectors: operations

We can also perform operations between vectors:

vec2 #show vec2
##  [1]  100  200  450  400  500  600  700  800  900 1000
vec1 # show vec1
##  [1]  1  2  3  4  5  6  7  8  9 10
vec1*vec2 # Multiply vec1 and vec2
##  [1]   100   400  1350  1600  2500  3600  4900  6400  8100 10000

Working on vectors: operations

We can also perform operations between vectors:

vec2 #show vec2
##  [1]  100  200  450  400  500  600  700  800  900 1000
vec1 # show vec1
##  [1]  1  2  3  4  5  6  7  8  9 10
vec1/vec2 # Divide vec2 and vec1
##  [1] 0.010000000 0.010000000 0.006666667 0.010000000 0.010000000 0.010000000
##  [7] 0.010000000 0.010000000 0.010000000 0.010000000

Working on vectors: operations

We can also perform operations between vectors:

vec2 #show vec2
##  [1]  100  200  450  400  500  600  700  800  900 1000
vec1 # show vec1
##  [1]  1  2  3  4  5  6  7  8  9 10
vec1+vec2 # sum vec1 and vec2
##  [1]  101  202  453  404  505  606  707  808  909 1010

Working on vectors: operations

We can also perform operations between vectors:

vec2 #show vec2
##  [1]  100  200  450  400  500  600  700  800  900 1000
vec1 # show vec1
##  [1]  1  2  3  4  5  6  7  8  9 10
vec1-vec2 # Subtract vec2 to vec1
##  [1]  -99 -198 -447 -396 -495 -594 -693 -792 -891 -990

Working on vectors: operations

WARNING!

When performing operations between vectors, the elements of the shorter one are recycled.

v1 <- 1:5
v2 <- c(2,3)

v1*v2 # What is going on here?
## Warning in v1 * v2: longer object length is not a multiple of shorter object
## length
## [1]  2  6  6 12 10

Working on vectors: recycling


Working on vectors: recycling


Working on vectors: recycling


Data visualization : Scatterplot

x <- 1:10 # create a vector from 1 to 10
y <- 51:60 # create a vectro from 51:60
plot(x,y) #plot x and y

Data visualization : Scatterplot

x <- 1:10 # create a vector from 1 to 10
y <- 51:60 # create a vector from 51:60
plot(x,y) # plot x and y
lines(x,y,col="red") #add a red line 

Data visualization : histogram

x <- rnorm(1000,mean=10) # create random normal 
                         # distribution of 1,000 samples with mean=10

hist(x)

Matrices and data.frames

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    7   24   13   87   95   50   72   95
## [2,]    6   44   40   21   34   34   92   70
## [3,]    4   83   24   26   62   19   33    9
## [4,]   55   84   35   52   55   84   61   92
## [5,]    7   37   31   68   28   32   29    5
  • Matrices are R objects in which data are organised in a 2-dimensional array.
  • they can contain only elements of the same types (eg. only numbers, only characters, etc.).
  • A matrix is created through the matrix() function.

Matrices

m=matrix(1:100,nrow=10)
m
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3   13   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100

Matrices

m <- matrix(1:100,nrow=10,byrow = T)
m
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]   11   12   13   14   15   16   17   18   19    20
##  [3,]   21   22   23   24   25   26   27   28   29    30
##  [4,]   31   32   33   34   35   36   37   38   39    40
##  [5,]   41   42   43   44   45   46   47   48   49    50
##  [6,]   51   52   53   54   55   56   57   58   59    60
##  [7,]   61   62   63   64   65   66   67   68   69    70
##  [8,]   71   72   73   74   75   76   77   78   79    80
##  [9,]   81   82   83   84   85   86   87   88   89    90
## [10,]   91   92   93   94   95   96   97   98   99   100

Matrices

class(m) # shows the class of the object m
## [1] "matrix" "array"
is.matrix(m) # shows TRUE if m is a matrix, FALSE otherwise
## [1] TRUE
dim(m) # shows the number of rows and column of object m
## [1] 10 10
nrow(m) # shows the number of rows ob object m
## [1] 10
ncol(m) # shows the number of columns of object m
## [1] 10

Matrices: data extraction

m[3,] # extract the third row 
##  [1] 21 22 23 24 25 26 27 28 29 30
m[,1] # extract the first column
##  [1]  1 11 21 31 41 51 61 71 81 91

Matrices: data extraction

# extract the values of the 2nd and third row 
# in the 5th, 6th and 9th column
m[2:3,c(5,6,9)] 
##      [,1] [,2] [,3]
## [1,]   15   16   19
## [2,]   25   26   29
m[6,7] # extract the value of the 6th row and senth column
## [1] 57

Dataframes

An important R class is dataframe, which is characterised by high flexibility. In fact despite matrices, dataframes can contain multiple types of data

gene <-  c("ADGRB2", "TARBP1", "SLC22A24", "SLC15A5", 
           "LOC117975386", "C17H17orf99", "TJP3", "TRIB3")
expression_in_cases <-c(100, 30, 35, 66,90, 200,12,8) 
expression_in_controls <-c(10, 45, 55, 70,110, 80,25,85) 

We can create a data.frame using the data.frame() function:

df <- data.frame(gene,expression_in_cases,expression_in_controls)

dataframes allow for data extraction using the same notations we saw before for matrices So, what the code needed to show only the first two rows of the data.frame?

Dataframes

An important R class is dataframe, which is characterised by high flexibility. In fact despite matrices, dataframes can contain multiple types of data

gene <-  c("ADGRB2", "TARBP1", "SLC22A24", "SLC15A5", "LOC117975386", "C17H17orf99", "TJP3", "TRIB3")
expression_in_cases <-c(100, 30, 35, 66,90, 200,12,8) 
expression_in_controls <-c(10, 45, 55, 70,110, 80,25,85) 

We can create a data.frame using the data.frame() function:

df <- data.frame(gene,expression_in_cases,expression_in_controls)

So, what the code needed to show only the first two rows of the data.frame?

df[1:2,]
##     gene expression_in_cases expression_in_controls
## 1 ADGRB2                 100                     10
## 2 TARBP1                  30                     45

Dataframes: data extraction

df[1:2,]
##     gene expression_in_cases expression_in_controls
## 1 ADGRB2                 100                     10
## 2 TARBP1                  30                     45

dataframes also allow to select specific columns usig the $ notations

df$gene #show the column "gene"
## [1] "ADGRB2"       "TARBP1"       "SLC22A24"     "SLC15A5"      "LOC117975386"
## [6] "C17H17orf99"  "TJP3"         "TRIB3"
df$expression_in_cases # show the column "expression_in_cases" 
## [1] 100  30  35  66  90 200  12   8

Dataframes: transformation

A dataframe can be converted in matrix using the function as.matrix()

m <- as.matrix(df)
class(m)
## [1] "matrix" "array"

We can convert a matrix to a datafram using as.dataframe()

df <- as.data.frame(m)
class(df)
## [1] "data.frame"

Dataframes: transformation

We can also explore the structure of the object using str()

str(m) # show the structure of the object m
##  chr [1:8, 1:3] "ADGRB2" "TARBP1" "SLC22A24" "SLC15A5" "LOC117975386" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "gene" "expression_in_cases" "expression_in_controls"
str(df) # show the structure of the object df
## 'data.frame':    8 obs. of  3 variables:
##  $ gene                  : chr  "ADGRB2" "TARBP1" "SLC22A24" "SLC15A5" ...
##  $ expression_in_cases   : chr  "100" " 30" " 35" " 66" ...
##  $ expression_in_controls: chr  " 10" " 45" " 55" " 70" ...

R data structures: Overview

Here is a brief overview of some fundamental data structure in R

object modes Allow_multiple_modes
vector numeric, character, complex, logical No
matrix numeric, character, complex, logical No
dataframe numeric, character, complex, logical yes

R: Challenge yourself


  1. Open Rstudio
  2. Type set.seed(123)
  3. Create a vector m of length 20, composed by numbers drawn by a normal distribution with mean=0 and sd=1
  4. Create a vector n with the first 5 elements from the vector m
  5. Plot an histogram of m

R: Challenge yourself


  1. Create a matrix mat composed by 10 rows and 10 columns with numbers from 201…300
  2. Replace the value located at the second row, third column with 84
  3. Plot a scatterplot of the second column of mat in x axis and the and fifth column of mat on the y axis
  4. Convert the matrix mat into a dataframe df
  5. Show the column “V2” of the dataframe df using the $ notation

R: Challenge yourself

set.seed(123) #2. Type ```set.seed(123)```
m=rnorm(20,mean=0,sd=1) # 3. Create a vector **m** of length 20, composed by numbers drawn by a normal distribution with mean=0 and sd=1
n=m[1:5] # Create a vector **n** with the first 5 elements from the vector **m**
hist(m) # 5. Plot an histogram of **m**

R: Challenge yourself

# 6. Create a matrix **mat** composed 
#by 10 rows and 10 columns with numbers from 201...300
mat <- matrix (data = 201:300,nrow=10,ncol=10) 
# 7. Replace the value located 
# at the second row, third column with 84
mat[2,3] <- 84  

R: Challenge yourself

# 8. Plot a *scatterplot* of the second column of
# mat in x axis and the and fifth column of mat on the y axis
plot(mat[,2],mat[,5]) 

R: Challenge yourself

#9. Convert the matrix **mat** into a dataframe **df**
df=as.data.frame(mat) 
#10. Show the column "V2" of the dataframe *df* using the **$** notation
df$V2  
##  [1] 211 212 213 214 215 216 217 218 219 220

R: working with real datasets

We are now ready to work on a real problem.

Let’s assume you got 1000 euros to do a pilot study on the genetic basis of drug addiction.

I want to be quick and efficient, and I need to prioritise my research towards genes that are ACTUALLY relate to drug addiction.

After hours and hours of literature review I found a list of 12 candidate genes, and their expression at 54 different tissues. They are saved on the file “gene_expression_data.txt”

R: working with real datasets

  1. Open Rstudio
  2. Set the working directory to “esercitazione”
  3. Let’s read the “gene_expression_data.txt” file in R
gene_expression <- read.table(file = "./data/gene_expression_data.txt",
                              header = T,sep = "\t",check.names = F)

R: working with real datasets

  1. Let’s have a look at the dataset
gene_expression[1:5,1:5]
##                 Name Description Adipose - Subcutaneous
## 1  ENSG00000114166.7       KAT2B               23.67930
## 2 ENSG00000112038.17       OPRM1                0.00000
## 3 ENSG00000170312.15        CDK1                4.05339
## 4 ENSG00000138162.18       TACC2               10.08120
## 5 ENSG00000189319.13      FAM53B               16.55860
##   Adipose - Visceral (Omentum) Adrenal Gland
## 1                     20.10720       9.56261
## 2                      0.00000       0.00000
## 3                      2.76667       1.45045
## 4                      8.07309       4.56019
## 5                     13.62330      25.18740

R: working with real datasets

  1. Show the list of genes in the dataset

R: working with real datasets

  1. Show the list of genes in the dataset
gene_expression$Description
##  [1] "KAT2B"       "OPRM1"       "CDK1"        "TACC2"       "FAM53B"     
##  [6] "FAM53B-AS1"  "BRSK2"       "ANKS1B"      "MYH10"       "CCDC42"     
## [11] "SLC14A2"     "SLC14A2-AS1"

R: working with real datasets

  1. Show the list of tissues in the dataset
colnames(gene_expression)[-c(1:2)]
##  [1] "Adipose - Subcutaneous"                   
##  [2] "Adipose - Visceral (Omentum)"             
##  [3] "Adrenal Gland"                            
##  [4] "Artery - Aorta"                           
##  [5] "Artery - Coronary"                        
##  [6] "Artery - Tibial"                          
##  [7] "Bladder"                                  
##  [8] "Brain - Amygdala"                         
##  [9] "Brain - Anterior cingulate cortex (BA24)" 
## [10] "Brain - Caudate (basal ganglia)"          
## [11] "Brain - Cerebellar Hemisphere"            
## [12] "Brain - Cerebellum"                       
## [13] "Brain - Cortex"                           
## [14] "Brain - Frontal Cortex (BA9)"             
## [15] "Brain - Hippocampus"                      
## [16] "Brain - Hypothalamus"                     
## [17] "Brain - Nucleus accumbens (basal ganglia)"
## [18] "Brain - Putamen (basal ganglia)"          
## [19] "Brain - Spinal cord (cervical c-1)"       
## [20] "Brain - Substantia nigra"                 
## [21] "Breast - Mammary Tissue"                  
## [22] "Cells - Cultured fibroblasts"             
## [23] "Cells - EBV-transformed lymphocytes"      
## [24] "Cervix - Ectocervix"                      
## [25] "Cervix - Endocervix"                      
## [26] "Colon - Sigmoid"                          
## [27] "Colon - Transverse"                       
## [28] "Esophagus - Gastroesophageal Junction"    
## [29] "Esophagus - Mucosa"                       
## [30] "Esophagus - Muscularis"                   
## [31] "Fallopian Tube"                           
## [32] "Heart - Atrial Appendage"                 
## [33] "Heart - Left Ventricle"                   
## [34] "Kidney - Cortex"                          
## [35] "Kidney - Medulla"                         
## [36] "Liver"                                    
## [37] "Lung"                                     
## [38] "Minor Salivary Gland"                     
## [39] "Muscle - Skeletal"                        
## [40] "Nerve - Tibial"                           
## [41] "Ovary"                                    
## [42] "Pancreas"                                 
## [43] "Pituitary"                                
## [44] "Prostate"                                 
## [45] "Skin - Not Sun Exposed (Suprapubic)"      
## [46] "Skin - Sun Exposed (Lower leg)"           
## [47] "Small Intestine - Terminal Ileum"         
## [48] "Spleen"                                   
## [49] "Stomach"                                  
## [50] "Testis"                                   
## [51] "Thyroid"                                  
## [52] "Uterus"                                   
## [53] "Vagina"                                   
## [54] "Whole Blood"

R: working with real datasets

One way to decide to which gene put more effort (and money) I will look their expression in different brain tissues using the barplot(). I will focus on the following tissues




  • Brain - Cerebellum
  • Brain - Cortex
  • Brain - Hippocampus
  • Brain - Hypothalamus

R: working with real datasets

Brain - Cerebellum

barplot(gene_expression$`Brain - Cerebellum`,
        names.arg = gene_expression$Description,las=2, 
        col="blue",cex.names = .8,ylab="transcript per million")

R: working with real datasets

Brain - Cortex

barplot(gene_expression$`Brain - Cortex`,
        names.arg = gene_expression$Description,las=2, 
        col="green",cex.names = .8,ylab="transcript per million")

R: working with real datasets

Brain - Hippocampus

barplot(gene_expression$`Brain - Hippocampus`,
        names.arg = gene_expression$Description,las=2, 
        col="red",cex.names = .8,ylab="transcript per million")

R: working with real datasets

Brain - Hypothalamus

barplot(gene_expression$`Brain - Hypothalamus`,
        names.arg = gene_expression$Description,las=2, 
        col="purple",cex.names = .8,ylab="transcript per million")

R: Wrap it all- Which gene is a good candidate to study?



par(mfrow=c(2,2),mar=c(6,2.5,1.5,1))

barplot(gene_expression$`Brain - Cerebellum`,
        names.arg = gene_expression$Description,las=2, 
        col="blue",main="Cerebellum")
barplot(gene_expression$`Brain - Cortex`,
        names.arg = gene_expression$Description,las=2, 
        col="green",main="Cortex")
barplot(gene_expression$`Brain - Hippocampus`,
        names.arg = gene_expression$Description,las=2, 
        col="brown",main="Hippocampus")
barplot(gene_expression$`Brain - Hypothalamus`,
        names.arg = gene_expression$Description,las=2, 
        col="purple",main="Hypotalamus")

R: Wrap it all- Which gene is a good candidate to study?